[Draft]Append padding logic for Attention #3256

Angazenn · 2025-09-29T02:27:22Z

What this PR does / why we need it?

This PR aims to add padding logic to seq_lens、block_tables when running in full decode scenario. Before this PR, the number of input tokens with padding might exceeds corresponding seq_lens. For example, when running in full decode scenario:

input_ids : [1, 3, 0, 0]
seq_lens: [2, 1]
query_start_loc: [0, 1, 2]

Here, input_ids is padded by 2 tokens while seq_lens/query_start_loc are not. The mismatch between input_ids and seq_lens/query_start_loc might cause some potential bugs. This PR would change it into :

input_ids : [1, 3, 0, 0]
seq_lens: [2, 1, 1, 1]
query_start_loc: [0, 1, 2, 3, 4]

Does this PR introduce any user-facing change?

No.

How was this patch tested?

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@releases/v0.11.0

Signed-off-by: Angazenn <[email protected]>

github-actions · 2025-09-29T02:27:30Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces padding logic for attention metadata, which is necessary for ACL graph capture with dynamic batch sizes. The changes in vllm_ascend/worker/model_runner_v1.py correctly control when this padding is applied. However, I've found a couple of critical issues in the padding implementation in vllm_ascend/attention/attention_v1.py. Specifically, there's an off-by-one error when padding query_start_loc_cpu, and query_lens is not being padded at all, which will lead to tensor shape mismatches and incorrect attention computation. I've provided a code suggestion to fix these issues.

gemini-code-assist · 2025-09-29T02:29:02Z

vllm_ascend/attention/attention_v1.py

+            padded_num_tokens = common_attn_metadata.graph_pad_size - num_actual_tokens
+            seq_lens = torch.cat([seq_lens, torch.ones(padded_num_tokens, dtype=seq_lens.dtype, device=seq_lens.device)])
+            block_table_padding = torch.zeros(
+                (padded_num_tokens, ) + block_table.shape[1:],
+                dtype=block_table.dtype,
+                device=block_table.device)
+            block_table = torch.cat([block_table, block_table_padding],
+                                    dim=0)
+            query_start_loc_cpu = torch.cat([query_start_loc_cpu, torch.arange(query_start_loc_cpu[-1] + 1, query_start_loc_cpu[-1] + padded_num_tokens, dtype=query_start_loc_cpu.dtype, device=query_start_loc_cpu.device)])


There are two issues in this padding logic that will cause problems during execution:

query_start_loc_cpu padding is incorrect: The call to torch.arange has an off-by-one error. The end parameter is exclusive, so torch.arange(start, end) generates end - start elements. Your code generates padded_num_tokens - 1 elements, but you need to pad padded_num_tokens elements to match the padded seq_lens. This will cause a tensor size mismatch. The end argument should be query_start_loc_cpu[-1] + padded_num_tokens + 1.

query_lens is not padded: You are padding seq_lens, block_table, and query_start_loc_cpu, but query_lens (calculated before this block) is not updated. This will lead to a mismatch in the number of sequences between query_lens and other padded tensors when creating AscendMetadata, causing errors downstream where query_lens is used. query_lens should also be padded with ones for the new sequences.

I've provided a suggestion to fix both issues.

padded_num_tokens = common_attn_metadata.graph_pad_size - num_actual_tokens seq_lens = torch.cat([seq_lens, torch.ones(padded_num_tokens, dtype=seq_lens.dtype, device=seq_lens.device)]) query_lens = torch.cat([query_lens, torch.ones(padded_num_tokens, dtype=query_lens.dtype, device=query_lens.device)]) block_table_padding = torch.zeros( (padded_num_tokens, ) + block_table.shape[1:], dtype=block_table.dtype, device=block_table.device) block_table = torch.cat([block_table, block_table_padding], dim=0) new_query_locs = torch.arange( query_start_loc_cpu[-1] + 1, query_start_loc_cpu[-1] + padded_num_tokens + 1, dtype=query_start_loc_cpu.dtype, device=query_start_loc_cpu.device) query_start_loc_cpu = torch.cat([query_start_loc_cpu, new_query_locs])

Signed-off-by: Angazenn <[email protected]>

fix padding logic

d3a9f49

Signed-off-by: Angazenn <[email protected]>

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

fix lint

4cc8816

Signed-off-by: Angazenn <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft]Append padding logic for Attention #3256

[Draft]Append padding logic for Attention #3256

Uh oh!

Angazenn commented Sep 29, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

Uh oh!

[Draft]Append padding logic for Attention #3256

Are you sure you want to change the base?

[Draft]Append padding logic for Attention #3256

Uh oh!

Conversation

Angazenn commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Angazenn commented Sep 29, 2025 •

edited

Loading